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Abstract We investigate efficient algorithmic realisa- 
tions for robust deconvolution of grey- value images with 
known space-invariant point-spread function, with em- 
phasis on ID motion blur scenarios. The goal is to make 
deconvolution suitable as preprocessing step in auto- 
mated image processing environments with tight time 
constraints. Candidate deconvolution methods are se- 
lected for their restoration quality, robustness and effi- 
ciency. Evaluation of restoration quality and robustness 
on synthetic and real- world test images leads us to focus 
on a combination of Wiener filtering with few iterations 
of robust and regularised Richardson-Lucy deconvolu- 
tion. We discuss algorithmic optimisations for specific 
scenarios. In the case of uniform linear motion blur in 
coordinate direction, it is possible to achieve real-time 
performance (less than 50 ms) in single-threaded CPU 
computation on images of 256 x 256 pixels. For more 
general space- invariant blur settings, still favourable com- 
putation times are obtained. Exemplary parallel imple- 
mentations demonstrate that the proposed method also 
achieves real-time performance for general ID motion 
blurs in a multi-threaded CPU setting, and for general 
2D blurs on a GPU. 



1 Introduction 

Besides noise, blur is perhaps the most common type 
of degradation in a wide range of imaging modalities. 
Sources of blur vary widely, from atmospheric pertur- 
bations via optical aberrations, motion of both objects 
and imaging appliances down to defocussing. Common 
to all kinds of blur is that it interferes with the detection 
of localised image features, and thereby impedes inter- 
pretation of images, be it by humans or by automated 
systems. 

It has therefore been a long-standing goal of image 
processing research to remove blur from so degraded im- 
ages. For this task, deconvolution, numerous approaches 
have been designed that differ in the requirements they 
impose on the input data, the quality of results, and the 
computational effort. First work in the context of ID 
signal processing goes back almost eighty years [T7] . 
Approaches studied since then range from fast linear 
filters and Fourier-based methods like the Wiener filter 
[22] to non-linear iterative schemes derived from vari- 
ational and/or statistical models [2l[6lfTn[T2l [T4 l [T8 l fT9] . 
All sorts of deconvolution problems are highly ill-posed, 
and there is inevitably a trade-off between restoration 
quality and speed of computation. 
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Basic classification of deconvolution tasks. In the clas- 
sification of deconvolution problems, the most impor- 
tant dichotomy is that between non-blind and blind de- 
convolution. Whereas in the first case the blur is known, 
blind deconvolution assumes that only the observed im- 
age is available, and the blur is to be estimated along 
with the sharp image. As blind deconvolution is much 
more underconstrained than deconvolution with known 
blur, it involves higher computational cost and often 
generates poorer results. 



2 



M. Welk, P. Raudaschl, T. Schwarzbauer, M. Erler, M. Lauter 



A second distinction is that between space-invariant 
blur, in which the so-called point-spread function is the 
same for all locations in the image domain, and space- 
variant blur, where different pixels are smeared in differ- 
ent ways. In the space-invariant case, the blur operator 
reduces to a standard convolution, such that Fourier 
methods can be used, and also forward (blur) opera- 
tions in iterative schemes can be computed by the Fast 
Fourier Transform. 

Towards real-time deconvolution. In this contribution, 
we consider robust deconvolution with known space- 
invariant blur under tight time constraints, with the 
aim to achieve real-time performance at least in some 
cases. We will therefore select deconvolution methods 
for robust deconvolution performance, and discuss mea- 
sures to optimise their algorithmic efficiency. 

Real-time deconvolution has also been considered in 
some recent papers, see PITU]. Both papers address the 
case of general 2D blur on the basis of the algorithm 
from [llj . and reach real-time performance when us- 
ing GPUs with several hundred computing units under 
CUDA. In [5], even blind deconvolution is considered; 
however, the blind deconvolution case is far from real- 
time performance even when computing on a GPU. 

Comparing to these contributions, however, our goal 
is different. We want to investigate under which condi- 
tions real-time performance can be achieved in single- 
threaded CPU computation. This was an essential re- 
quirement of the application context from which this 
work emerged. Additional work on multi-threaded CPU 
and GPU realisations is presented to put the results in 
context. 

To accomplish this task, we are willing, in turn, to 
sacrifice some generality: We focus on setups in which 
some parameters can be chosen in a way such as to 
reduce the computational expense. For example, we ac- 
cept the limitation to power-of-two image dimensions 
such as to profit best from Fast Fourier Transform. 
Also, we consider settings restricted to either one-di- 
mensional blur, or even more specifically just linear 
motion blur, and assume herein that the camera sys- 
tem can be adjusted in such a way that the ID blur is 
in the direction of a coordinate axis of the sensor. 

Robustness. On the other hand, with practical applica- 
tions of deconvolution as goal, robustness is a crucial 
aspect for our investigation. The concept of robustness 
originates from statistics [9] and refers to estimators 
that are as insensitive as possible to outliers and model 
violations. In the case of deconvolution, this includes 
not only strong noise (e.g., salt-and-pepper noise in [5]) 



but also imprecise blur estimates, or errors near the im- 
age boundary caused by blurring across the boundary, 
see [2T] . 

There is a long tradition of testing deconvolution al- 
gorithms with synthetically blurred images. Many pa- 
pers, including [TUlfTTll^ . use only test cases of this 
kind. Often also the noise levels being considered are 
fairly low (in the range of the quantisation noise al- 
ready caused by discretising grey- values to integers in 
[0,255]). Such test scenarios have their merit because 
they allow to measure reconstruction error against a 
ground truth. However, it is particularly dangerous in 
the case of such a severely ill-posed problem like de- 
convolution to rely only on tests with synthetic blur. 
Methods that perform well on such data do often not 
live up to their promises when applied to real-world 
images whose blur or noise deviates slightly from the 
perfect theoretical model underlying the deconvolution 
approach. 

Application context. The motivation for the work pre- 
sented in this paper is to devise a deconvolution frame- 
work that can be used as preprocessing step for further 
automated image processing tasks under real-time con- 
ditions within industrial processes. It is important to 
note that the embedding into an industrial process also 
limits the computational resources that are available for 
computation. While parallelisation on multicore CPUs 
and GPUs plays an ever-increasing role in current high- 
performance computing, an essential constraint of the 
application context that motivated the present work 
was that not more than one CPU kernel could be as- 
signed to the image processing task. 

The setting under consideration involved imaging 
moving objects by stationary cameras. The resulting 
motion blur is to be compensated by deconvolution in 
order to allow the detection and localisation of features 
on moving objects as well as recognition of shapes. To 
this end, a sufficient sharpening of edges and lines would 
be necessary. Ringing artifacts should be suppressed 
sufficiently in order to not create false detections. Fi- 
nally, in order to be feasible together with subsequent 
processing tasks in real-time, deconvolution should not 
take more than about 50 ms of single-threaded compu- 
tation on a contemporary standard CPU. 

In addressing this problem, setups of different de- 
gree of generality were considered: firstly, blur by uni- 
form linear motion; secondly, blur by non-uniform lin- 
ear motion; and thirdly, a general 2D blur model. In 
the linear motion cases, it was assumed that the blur 
direction would be aligned with a coordinate axis of the 
imaging system by technical measures. In all cases, it 
was assumed that the imaging conditions would be suf- 
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ficiently controllable to ensure that the blur would be 
known, so no blind deconvolution was considered. 



2 Mathematical Models for Deconvolution 

To describe deconvolution models that were investi- 
gated in order to accomplish the task outlined above, we 
start from the common spatially invariant blur model 
for grey-value images, 



2.2 Richardson-Lucy Deconvolution 

One of the most time-proven and popular iterative de- 
convolution method is Richardson-Lucy (RL) deconvo- 
lution ^MM 



,k+l 



f 



(3) 



f{x) = {g * h){x) + n{x) , 



(1) 



The single parameter of this method is the number of 
iterations. Unlike the Wiener filter, RL requires and 
preserves positivity of grey-values, and is related to 
a Poisson noise model. The method features a semi- 
convergence behaviour: With increasing number of it- 
erations, sharpness is improved, but at the same time 
noise is amplified and leads to divergence after an initial 
convergence phase. 



where / is the observed blurred image, g is the un- 
observable sharp image, h the (known) space-invariant 
point-spread function (PSF), and n additive noise. By 

X = {x,y) e M2 we denote the location in the image 2.3 Variational Models for Deconvolution 
plane. 

In the following, u denotes the processed image that 
is to approximate g. In iterative methods, we denote the 
fc-th iterate by u*^. Some methods involve convolution 
with h*, the adjoint of h. In our space- invariant setting, 
h* is just h reflected at the origin, i.e. h*{x) = h{—x). 



A larger class of iterative methods arises from varia- 
tional models. In these, one aims at minimising energy 
functionals pil^ that combine a so-called data term 
which penalises deviations from the blur model with 
a regulariser enforcing some smoothness constraint on 
the image. Such an energy functional can read as [2ll21] 



2.1 Wiener Filter 



)dx 



(4) 



The Wiener filter [52] filters the Fourier transform / 
of / to approximately invert the multiplication with h 
that corresponds to the convolution with h\ it reads 



K 



(2) 



with the filter parameter K > Q that dampens the am- 
plification of those frequencies for which \h\ is zero or 
small. Note that h. the complex conjugate of /i, is the 
Fourier transform of h* . Generally, the sharpening ef- 
fect of the filter is the more pronounced, the smaller 
K is chosen. However, smaller K also implies stronger 
amplification of noise, such that larger K must be used 
when strong noise is present. Using a Gaussian noise 
model, K can be chosen dependent on the standard de- 
viation of noise such as to minimise the mean square 
error of the approximation of g by m. As a single-step 
method, the Wiener filter is fast, but its applicability is 
limited since it can hardly cope with more severe noise, 
and like all linear methods tends to produce oscilla- 
tory over- and undershoots near contrasted structures, 
known as ringing artifacts. 



where ^ : M([ are non-decreasing differen- 

tiable functions. Both contributions are balanced by a 
positive weight parameter a. 

Besides the identity function <P(s^) = s^, which cor- 
responds to a quadratic error measure, a common choice 
for the pcnaliser function <1> is the regularised error 
<?(s^) = Vs^ + with a small e > 0, see e.g. [5]. The 
same type of function can be used for The resulting 
total variation penaliser |15| is popular in image pro- 
cessing due to its favourable edge-preserving behaviour. 
In order to even enhance edges, one can use even pe- 
nalisers !f'(s^) that are non-convex w.r.t. s, see [T1[5D]. 
Quadratic penalisation in the regulariser is rarely used 
nowadays in deblurring because of its blurring effect 
that directly counteracts the data term. 

The objective function in |11| . motivated there from 
statistical considerations, is a discrete version of the 
energy functional 



E[u] 



(5) 



which obviously amounts to an instance of @ with the 
quadratic data penalty <P{s'^) = s^, and a non-convex 
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power function as regulariser. With regard to the spe- 
cific optimisation that is employed for minimisation in 
[TT] . special values like a = 2/3,3/4,4/5 arc preferred. 
Also in |18| an energy of this type is minimised but with 
total variation regularision, i.e. a — I. 
In [in], the energy functional 



E[u]= j <^{rf{u*h)) + a^(\Vu\^)dx 



(6) 



is introduced whose data term does not depend on (/ — 
u*h) but instead of the so-called information divergence 



Tf{s) :=s-/-/ln(s//). 



(7) 



A simpler version of the functional © can also be linked 
to the Richardson-Lucy method [IB] . 



Techniques for energy minimisation. A gradient descent 
for Q can be derived using the Euler-Lagrange frame- 
work. Discretising in time by an Euler forward scheme 
with time step size r > one arrives at the iteration 



= + t( (<?'((/ -u* hf) ■{f~u*h))*h* 

+ adiv{l''{\Vu\^) -Vu)) (8) 



,fc+i 



whose update term combines a sharpening term (first 
summand) that acts to reduce the data term with a 
nonlinear diffusion term (second summand) with the 
diffusivity that enforces image regularity, compare 
pi] . The functions Rq R+ are nonincreasing. 

Unfortunately, gradient descent is a fairly slow pro- 
cedure, and thus presently not a candidate for decon- 
volution under real-time or near-real-time conditions. 

The half-quadratic optimisation technique used in 
|lllll8j is another approach to minimising an energy 
approach. The nonlinearity in the regulariser is decou- 
pled from the minimisation of the data term by means 
of additional variables and a coupling term, weighted 
by an additional parameter /3. Following a continuation 
strategy, /? is successively increased during the compu- 
tation. For details see PTIITB] . 

For the energy functional ([6]) a positivity-preserving 
iterative minimisation scheme in the style of the RL 
method can be derived [71[T^. which is called robust 
and regularised RL deconvolution (RRRL) and reads 
as 



,k+l _ ( 



(9) 



where 

W^'^ :=<?>'(r/(u'=*/i)) , (10) 
D^'(m'^) :=div(<P''(|Vu'=|2)VM'=) , (11) 

[z]±:=liz±\z\). (12) 

The numerical realisation of these methods will be 
considered in Section HTT] 

In theory, the different noise models underlying de- 
convolution models, such as Gaussian noise for the Wie- 
ner filter, Poisson noise for RL etc., render these models 
suitable for distinct application areas. In practical ap- 
plication, however, often not all sources of noise are 
known and controllable, such that it is unavoidable 
to apply deconvolution methods also to data that do 
not perfectly match their respective noise models. In- 
deed, robustness, as mentioned in the introduction, is 
all about models being capable of handling this kind of 
mismatch. 

3 Evaluation of Deconvolution Quality 

To assess the suitability of the before-mentioned ap- 
proaches for our deconvolution task, we start by evalu- 
ating their restoration quality. Gradient descent is not 
considered further because of its slowness mentioned 
earlier. Apart from that, efficiency optimisation is still 
not the goal at this point. All tests in this section are 
therefore based on deconvolution implementations for 
general 2D blurs with Fourier convolutions. 

We compare thus Wiener filter, Richardson-Lucy 
deconvolution (RL), the iterative methods by Wang et 
al. (WYYZ) [IH] and by Krishnan and Fergus (KF) [TT], 
and RRRL [TJUS]. 

3.1 Synthetically Blurred Images 

Our first test scenario is based on a version of the pop- 
ular cameraman test image which has been synthet- 
ically blurred with a space- invariant "camera-shake" - 
like point-spread function of irregular shape. Fig. [TJa). 
No noise besides the quantisation noise is present in 
this test image. 

Since the deconvolution methods are implemented 
with Fourier domain convolution, the blur in the test 
images has been carried out via the spatial domain in 
order to avoid what is known as an "inverse crime" [S] . 
The boundary condition in the convolution was chosen 
as constant continuation (along normals to the bound- 
ary) which contrasts to the periodic continuation im- 
plicitly involved in the Fourier convolution in the de- 
blurring algorithms. 



Fast and Robust Linear Motion Deblurring 



5 




Fig. 1 Top row, left to right: (a) Cameraman image (256 X 256 pixels) synthetically blurred with a small irregular point- 
spread function (camera-shake type) using spatial domain convolution with constant continuation. Insert shows PSF (four 
times enlarged). - (b) Deblurred by Wiener filtering, K = 0.006. - (c) Deblurred by 30 iterations of RL. - (d) Deblurred by 
the KF method | 11| with a = 2/3, A = 50 and /3 from 1 to 128\/2 in powers of 2\/2, 1 iteration per level. Bottom row: (e) 
Deblurred by the WYYZ method [T5] with A = 50 and /3 from 1 to 128v^ in powers of 2^2, 1 iteration per level. - (f) 5 
iterations of RRRL, a = 0.003, with TV regulariser. - (g) 30 iterations of RRRL, same parameters as (f). - (h) 100 iterations 
of RRRL, same parameters as (f). 



Table [T] collects signal-to- noise ratios 



3.2 Combined Wiener-RRRL Method (WR^L) 



SNR(w,uo) = 10 logi 



var (it) 
var (u — uq) 



dB 



for blurred and deblurred images u compared to the 
unperturbed cameraman image uq. 

In Fig. H we test Wiener filter, KF and RRRL on 
three other synthetically blurred cameraman images to 
study the stability of results under additional noise and 
stronger blur. For signal-to-noise ratios see again Ta- 
ble [TJ In Fig. IHa-d), a fairly low amount of Gaussian 
noise is added to the test image. Subfig. (e-h) consider a 
spatially more extended point-spread function. A more 
drastic noise - impulse noise with 15 % density - is 
added to the first test image in Subfig. (i-1). 

It is evident both visually and from the SNR figures 
that RRRL, KF, and WYYZ generally allow for a good 
restoration. The Wiener filter is sensitive to boundary 
artifacts for larger blurs, and to non-Gaussian noise. 
The robust data term of the RRRL model gives it also 
an advantage over the KF and WYYZ methods in set- 
tings with boundary artifacts and more severe noise. 



It should be noted that in spite of its favourable proper- 
ties RRRL still takes fairly many iterations (30. . . 100, 
depending on noise level) to achieve an acceptable de- 
gree of sharpness. On the other hand, Wiener filter- 
ing, being a non-iterative method, provides a reason- 
able sharpness in one fast computation step but at the 
cost that the remaining noise and ringing artifacts are 
more pronounced. 

This motivates us to test a combined approach in 
which the Wiener filter is used as a first step, followed 
by an RRRL iteration for which the Wiener result acts 
as initialisation. A caveat in doing so is that the Wiener 
filter output can contain zero or negative grey-values, 
which cannot be handled within the RRRL model. How- 
ever, negative grey-values appear as part of artifacts 
anyway, so this can be countered simply by replacing 
all zero or negative grey- values to a small positive value 
in the input for RRRL. 

Table [T] includes this method, WR^L, in the last col- 
umn. Note that even with as few as 5 iterations WR'^L 
performs in most cases comparable to about 30 itera- 
tions of pure RRRL. Visual evaluation of WR'^L will be 
included in subsequent experiments. 



6 



M. Welk, P. Raudaschl, T. Schwarzbauer, M. Erler, M. Lauter 




Fig. 2 Top row, left to right: (a) Cameraman image synthetically blurred with a motion blur of 21 pixels length in 45° 
direction using constant continuation. Insert shows PSF (true size). - (b) Wiener filter, K = 0.006. - (c) KF method, same 
parameters as in Fig. [IJc) except for A = 3. - (d) 30 iterations of RRRL, same parameters as in Fig. [TJh). Middle row, left 
to right: (e) Blurred cameraman image from Fig. ^a) with additive Gaussian noise, ct = 5. Insert shows PSF (four times 
enlarged). - (f) Wiener filter, K = 0.06. - (g) KF method, same parameters as in Fig. [TJc) except for A = 25. - (h) RRRL 
as in (d). Bottom row, left to right: (i) Blurred cameraman image from Fig. ^a), with 15% of the pixels replaced by noise 
pixels with uniform distribution on [0,255]. Insert shows PSF (four times enlarged). - (j) Wiener filter, K = 0.16. - (k) KF 
method, same parameters as in Fig. [TJc) except for A = 1. - (1) RRRL as in (d). 



3.3 Restoration Quality for Real- World Images 



We turn now to real- world examples, taken under con- 
ditions similar to the industrial production context that 
our development is directed at. From here on, we have 
to rely on visual comparisons since a ground truth from 
which SNR could be computed is no longer available. 
Also, no exact knowledge on the noise distribution is 
available. On the other hand, the motion parameters 
determining the blur can be adjusted in these setups, 
such that the point-spread function is known. Moreover, 
the setting with objects being imaged before a uniform 
background largely removes boundary artifacts - in par- 
ticular, periodic continuation is unproblematic in this 
case. 



In Figs. [3] and m we present two real- world test im- 
ages with linear uniform motion blur. Deblurring results 
using Wiener filter, RL, KF, RRRL visually confirm 
the findings from our synthetic blur experiments. The 
Wiener filter allows for a visually favourable sharpen- 
ing that can be equalled by RL and RRRL only after 
about 30 iterations. However, amplified noise and arti- 
facts appear more prominent in the Wiener filter result. 
The combination WR'^L is studied in Subfigures (f- 
h) of both figures. Visually, the difference between the 
Wiener filter result, Fig. [21Jb)/1Ub), and the result af- 
ter additional 5 iterations of RRRL, Subfig. (f), ap- 
pears to be small. The exaggeration with 30 iterations 
in Subfig. (g) and the difference image in Subfig. (h) 
demonstrate that there is an improvement, though: the 
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Fig. 3 Top row, left to right: (a) Original image (256 x 256 pixels) of a prosthetic tooth with motion blur (blur length: 
27 pixels). Courtesy of Westcam Projektmanagement GmbH. - (b) Filtered by Wiener filter, K = 0.006. - (c) Filtered by 30 
iterations of RL. - (d) Filtered by KF method, A = 200, /9 from 1 to 128v^ in multiplicative steps of 2\/2, 1 iteration per 
level. Bottom row, left to right: (e) Filtered by 30 iterations of RRRL with TV regulariser, a = 0.003. - (f) WR^L: Wiener 
filtering and 5 iterations of RRRL with TV regulariser, a = 0.003. - (g) Same but 30 iterations. - (h) Difference image of (b) 
and (f), range [-10,10] rescaled to [0,255]. 




Fig. 4 Top row, left to right: (a) Original image (256 x 256 pixels) with motion blur in vertical direction (blur length: 27 
pixels). Courtesy of Datacon GmbH. — (b) Filtered by Wiener filter, K = 0.006. - (c) 30 iterations of RL. - (d) KF method, 
A = 100, /9 from 1 to 128v'2 in multiplicative steps of 2\/2, 1 iteration per level. Bottom row, left to right: (e) 30 iterations 
of RRRL, a = 0.003. - (f) WR^L with 5 iterations of RRRL. - (g) WR^L with 30 iterations of RRRL. - (h) Difl'erence image 
of (b) and (f), range [-10,10] rescaled to [0,255]. 



RRRL iterations keep the good initial sharpening of Ringing artifacts and noise are shghtly more prominent 
the Wiener fiher, while noise and artifacts are reduced. in the KF result than in the RRRL or WR^L results, 
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Fig. 5 Left to right: (a) Original image from Fig.UJa) with Gaussian noise, a = 10. - (b) Filtered by Wiener filter, K = 0.06. 
- (c) Filtered by KF, A = 3, /9 from 1 to 128\/2 in multiplicative steps of 2>/2, 1 iteration per level, (d) Wiener filter and 5 
iterations of RRRL, a = 0.003. - (e) Only 5 iterations of RRRL. 



which underlines that robust data terms are indeed ad- 
vantageous in real-world data. 

The effect of the combination WR^L becomes even 
more evident in Figure [5l where the original image has 
been degraded by additional Gaussian noise: Here, it is 
clear that the combination of Wiener filter and RRRL 
combines the sharpness achieved by a single Wiener 
filter step (but not by RRRL alone) with a visible noise 
reduction. 

Summarising the observations of this section, the 
combination of Wiener filter with at least five subse- 
quent iterations of RRRL provides a reasonable restora- 
tion quality. Our efficiency considerations will therefore 
focus on this method. 



4 Efficient Implementation 

In this section, we discuss aspects of the algorithmic 
realisation of the deconvolution methods under consid- 
eration. Wiener filter, RL, RRRL and the combined 
WR^L were implemented in three scenarios from gen- 
eral 2D blur down to ID linear motion blur. 

For WYYZ and KF only general 2D implementa- 
tions were done to allow comparisons. Our strategies 
to derive more efficient ID algorithms from the Wiener 
and RL filter family cannot be transferred straightfor- 
ward to the WYYZ and KF algorithms due to the dif- 
ferent structure of their iterations, in which the 2D reg- 
ularisation is intertwined with the 2D Fourier iteration 
step. 

While our focus is on single-threaded CPU compu- 
tation, we did also multi-threaded CPU and GPU im- 
plementations of some variants, which will also shortly 
be introduced. 



necessary for the Fourier transform, convolution oper- 
ations and the diffusion terms. In the case of RRRL, 
also the computation of the information divergence ([7]) 
deserves special attention. 

Fourier transform. For full control over implementa- 
tion details, we used our own implementation of the 
Fast Fourier Transform (FFT) for powcr-of-two image 
dimensions, based on |31 Par. 4.4.1.3], with adaptations 
to the rcal-valuedness of image data. The values of 
the complex exponential were prccomputcd once for all 
Fourier transforms during the program run. 

Convolution. Here we considered either realisations via 
the Fourier domain, or spatial-domain convolution by 
direct summation. 

Diffusion terms. These were discretised using standard 
central difference approximations. 

Function evaluations. In the RRRL iteration, the infor- 
mation divergence ([7]) is expensive to compute directly 
due to the logarithms. For this reason, values of ri(s) in 
the range s G [0, 65] were stored in a lookup table. Val- 
ues of r}{s) were then calculated by r/(s) = f ri{s/f), 
using a linear approximation for ri in (65, oo). (In our 
test examples, the latter approximation did in fact not 
occur.) 

In the KF(-S) algorithm the solution of a polyno- 
mial equation can be replaced by a lookup table. In 
[TT], speedups from about 1.7 (for 256 x 256 images) to 
4 (for large images) were obtained in this way. In our im- 
plementation, we stick to the slower analytic solution. 
However, by comparing with the WYYZ algorithm the 
possible speedup can be estimated. 



4.1 Numerics 

The numerical realisation of all deconvolution methods 
discussed is mostly straightforward, specifications being 



4.2 Boundary Treatment 

Since blur operations involve considerable transport of 
information across the image boundary, artifacts near 
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the image boundary are an issue in deconvolution. These 
artifacts are especially strong when simple boundary 
conditions like zero-padding, constant or periodic con- 
tinuation introduce massive model violations. Although 
advanced boundary treatment schemes are available that 
allow to reduce these artifacts considerably, see e.g. [1], 
these are computationally expensive. 

For performance reasons, we decide to use constant 
continuation for spatial convolution operations, and the 
natural periodic continuation for Fourier-based opera- 
tions, and tolerate the resulting artifacts. Note that in 
the application setting of Figs. [3H5] where objects are 
photographed in front of uniform backgrounds, almost 
no artifacts are introduced anyway. 



4.3 Algorithmic Optimisations 

Concluding from the qualitative tests, it is desirable 
to perform Wiener filtering plus at least five iterations 
of RRRL for a practically useful deconvolution in the 
real-time environment. Concerning the image size, we 
restrict ourselves to powers of two in order to profit 
most from the Fast Fourier Transform (FFT). Depend- 
ing on the exact algorithmic variant this restriction can 
be eased in application. 

In our examples, an image size of 256 x 256 is appro- 
priate to include a suitable region of interest for object 
detection or localisation. For more limited applicability, 
also images of size 128 x 128 may be considered. 

We discuss now algorithmic optimisations that can 
be used to achieve the desired deconvolution in the so 
defined setting. We will distinguish herein between uni- 
form linear motion and non-uniform linear motion, and 
compare both to a general 2D space-invariant blur sit- 
uation. 



General 2D setting. It is clear that in the general 2D 
blur scenario, Wiener filtering requires a two-dimensio- 
nal Fourier transform and inverse transform, which are 
implemented by FFT. 

In the iterative algorithms (RL, RRRL), convolu- 
tions can either be computed by direct numerical inte- 
gration in the spatial domain, or again via the Fourier 
domain. In the latter case, once more a 2D FFT is nec- 
essary. As the complexity of the spatial domain convo- 
lution is linear in the image size and PSF size, while the 
FFT implementation is log-linear in the image size only, 
spatial domain convolution may be superior to FFT for 
very small PSF but FFT will dominate for large convo- 
lution kernels. 



Non-uniform linear motion. In this case the blur is 
characterised by a point-spread function with ID sup- 
port. Assuming that it is aligned with a coordinate axis 
of the imaging device (say the vertical one), the Wiener 
filter needs to applied only in columns. Thus, a ID FFT 
is sufficient, saving up to half the numerical cost of the 
Fourier transform. 

Equally, only ID convolutions are needed in the it- 
erative method. If these are carried out in ID only, the 
program logic is slightly simplified compared to the 2D 
case but the number of multiplications and additions 
in evaluating the integral will be comparable to that of 
a 2D implementation for equal PSF size. In contrast, 
a Fourier-based realisation will again profit from using 
ID instead of 2D FFT. 

Uniform linear motion. The point-spread function in 
this case still is ID, but it takes the special form of 
a box filter, i.e. it is constant throughout its support. 
(This applies exactly when the blur length is integer. In 
the case of a non-integer blur length, one has to allow 
for single pixels of smaller weight at one or both ends 
of the kernel.) 

For the Wiener filter, this setting does not offer any 
specific advantage over the general ID case. The same 
holds true for the convolutions in the iterative method 
when computed via the Fourier domain. 

However, the spatial domain convolution can be made 
dramatically more efficient in this case by an efficient 
box filtering algorithm [T3]. In the case of a vertical 
linear blur this algorithm works as follows: In each col- 
umn, the convolution of the first pixel is computed by 
direct summation (complexity linear w.r.t. the kernel 
size). Then the sliding window is shifted one pixel at 
a time, such that one pixel enters the window while 
one pixel leaves it. So the sum inside the window is up- 
dated in constant time by adding one grey- value and 
subtracting another one. If non-integer blur length is 
admitted, the update step involves at most two addi- 
tions and two subtractions. The overall complexity of 
the update part is therefore linear in the image size, and 
the total complexity of the algorithm with blur kernel 
size m on an image of columns and Uy rows amounts 
to 0(nx ■ {riy + m)). 

4.4 Implementation and Technical Optimisation 

All CPU programs were written in C, and compiled us- 
ing gee 4.6 with optimisation level 02. With 03 some 
run times were further reduced by a few percent while 
others even increased slightly. Also more specific com- 
piler optimisation settings did not significantly improve 
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performance. However, efficiency was improved by stan- 
dard source code optimisation techniques like inlining, 
loop merging, blocking (for cache optimisation). Wher- 
ever possible, array access operations were avoided us- 
ing auxiliary variables. One-dimensional arrays were 
used to store image data. 

4.5 Parallelisation 

While our focus is on single-threaded CPU computa- 
tion, we consider two exemplary parallel implementa- 
tions to assess the possible gains by parallelisation on 
muhicore CPUs and CPUs. 

Multi-threaded CPU implementation, ID. We imple- 
mented the version of RRRL for general ID motion 
blur for multi-threaded computation on a multi-core 
CPU using the standard pthreads library. The Wiener 
filter was not parallelised in this setting because it con- 
tributes little to the overall run-time. 

The sharpening term of the RRRL iteration nicely 
decomposes in this case in columns parallel to the point- 
spread function, and is easily parallelised in this way. In 
contrast, the smoothing term is still made up by 2D dif- 
ferential expressions, so the columns interact with each 
other in the derivative computation. Since the overall 
computational cost of the sharpening term equals sev- 
eral times that of the smoothing term, an almost bal- 
anced workload is achieved on CPUs with 4. . . 8 cores if 
the regulariser is computed in the parent thread, while 
the sharpening term is distributed to the remaining 
cores in parallel threads. 

To reduce the overhead of creating and terminating 
threads, the parallel worker threads for the sharpening 
term are started before the first RRRL iteration, and 
not terminated before all iterations are done. Mutexes 
are used to synchronise the updates between worker 
threads and parent thread in each iteration. 

GPU implementation, 2D Fourier. We also implemen- 
ted the WR^L algorithm for general 2D point-spread 
functions using the CUDA 4.0 framework for Nvidia 
CPUs. Both the Wiener filter and the convolutions in 
the RRRL iteration are performed using CUDA's built- 
in Fast Fourier transforms. Efficiency of parallel access 
to some data (namely, the point-spread function) was 
improved by using texture memory. 

5 Experiments 

We measure the performance of deconvolution algo- 
rithms suitable for the three scenarios described in the 



previous section. For this, we used the gettimeof day () 
function since it states real-world run times within the 
context of the running system instead of pure process 
time, and easily allows measurements for any particular 
portion of an algorithm. The downside is that due to 
other activities (system processes etc.) run times will 
display considerable variation ~ we used a standard 
Linux system without any specific real-time scheduler. 
It is therefore necessary to consider statistics over many 
program runs. We report therefore averages, standard 
deviations and extremes from 100 subsequent program 
runs. 

Not included in our measurements is the time for 
loading and storing images. The rationale for this is 
that in time-critical industrial applications, image data 
would anyway be transferred into the memory directly 
from the imaging device. 

Also not included is the time for precomputing aux- 
iliary data for the Fourier transform. This is based on 
the assumption that this can be done once for a large 
number of equally sized images to be processed in an 
application context. 

5.1 Single-Threaded Wiener+RRRL 

Comparison across 1D/2D blur scenarios. We choose 
as our test case the image from Fig. 2] (256 x 256 pixels) 
with the uniform linear motion blur kernel of length 27, 
for which all algorithms could equally be applied. We re- 
mark that for the algorithms chosen (box filter, Fourier 
convolution/ Wiener filtering) the computational cost 
docs not or only slightly depend on the size of the blur 
kernel. In all cases, we compute Wiener filtering plus 
five iterations of RRRL. 

Run times were measured for single-threaded com- 
putation (thus using one core) on an AMD Phenom II 
X6 HOOT running at 3.3 GHz. Statistics for the Wiener 
step, single RRRL iterations, total run time of the five 
RRRL iterations and overall time are given in Table [5] 
Note that the Fourier transform of the point-spread 
function is computed once in the Wiener filter step and 
re-used if necessary in the RRRL iterations. 

From this table it is evident that the proposed fil- 
tering procedure can be carried out reliably in less than 
50 ms on 256 x 256 grey- value images with uniform lin- 
ear motion blur. For the more general blur scenarios, 
the computational expense still prevents real-time per- 
formance in the single-threaded setting, although even 
here favourable run-times are achieved. 

Different image sizes. In Table [3] run times for the fas- 
test algorithm (ID Wiener, RRRL with box filter) are 



Fast and Robust Linear Motion Deblurring 



11 



shown for three different image sizes, all with the same 
blur kernel size. As can be seen, the measured run times 
for the RRRL iteration scale by a factor 3 ... 5 for each 
quadrupling of the image size, reflecting the essentially 
linear complexity. In contrast, averaged run times for 
the Wiener filter step (actually log-linear) multiply only 
by 2.5 . . . 3.5 for each scale step. 

This effect can be traced back to outliers with large 
execution times that occur in the Wiener filter step 
of some program runs and sometimes also in the first 
RRRL iterations. These outliers seem to be caused by 
cache effects and amount to an almost constant-time 
overhead on the average times, and thus the misleading 
impression of a sub-linear scaling of the Wiener filter 
step. Indeed, when for testing the Wiener filtering step 
is performed twice, the second step displays less outliers 
and thus a lower average. Similarly, the first RRRL it- 
eration in each program run is slightly slower than the 
subsequent iterations (about 0.5 ms for 256 x 256, about 
2 ms for 512 x 512). The influence of the outliers is 
particularly pronounced for 128 x 128 images, as in- 
dicated by the large standard deviations. It is evident 
that measurements for these short execution times are 
less reliable than for the larger images. 



5.2 Comparison to WYYZ and KF 

We test the KF and WYYZ algorithms on the same 
256 x 256 test case as the WR'^L algorithms. Results 
are shown in Table |4l As these algorithms are imple- 
mented in the full 2D setting, the proper WR^L value 
for comparison comes from the last row of Table [H 

The KF algorithm {a — 2/3) with analytic solver 
is tested in the /3 parameter regime recommended in 
PT| (/3 scaling in powers of VS from 1 to 256, one in- 
ner iteration per level) which totals to 6 iterations. The 
resulting run-time is in good agreement with the origi- 
nal value of 0.7 s from |11| when compensating for the 
different CPU clocks and image dimensions. 

For the WYYZ algorithm, we use the same parame- 
ter regime. Since WYYZ differs from KF just by having 
a cheap shrinkage step where the expensive polynomial 
equation is solved in KF, the so measured run times give 
a reliable lower bound to the run time that KF could 
reach with the lookup table instead of the polynomial 
solver. Note, however, that (TS] suggests finer scaling 
steps for /? and substantially more iterations per level. 

It can be seen from these figures that WR'^L, albeit 
not the fastest algorithm, is competitive in terms of 
speed. 



5.3 Parallel Implementations of Wiener+RRRL 

Multicore CPU computation. The gap between the run- 
times displayed in Table[2]for the general ID setting for 
images of size 256 x 256 and our goal of real-time com- 
putation is not very large. Therefore it is interesting to 
see whether this gap could be bridged by multithreaded 
computation on a recent multicore CPU. 

We tested this on the Phenom X6 hexacore machine 
described in the previous subsection. In our parallelised 
implementation of the RRRL component the regulari- 
sation was computed in the main thread, and the sharp- 
ening terms distributed to five threads such that all six 
cores could be used. Run-time results are shown in the 
first row of Table [5] In our test setting with just five 
iterations, the effective speed-up factor for the RRRL 
computation is only about 2. 

A closer look at the run-times of single iterations 
reveals that typically the first two to three iterations 
are slower than the following ones. In fact, the averages 
and standard deviations for the five iterations are (7.0± 
3.5) ms, (7.3±1.8)ms, (4.7±1.3)ms, (4.2±0.9)ms, and 
(3.9±1.0) ms, respectively. It appears that the overhead 
caused by creating and managing threads chips away a 
lot of the possible gain in performance for such short 
computational tasks. Indeed, when running much more 
iterations, the average computation time per RRRL it- 
eration stabilises in the range of (3 ... 4) ms. The work- 
load is well balanced between cores, as indicated by the 
fact that top consistently shows CPU loads of about 
580 % for the process. 

In spite of the modest speed-up factor, the multicore 
computation achieves the goal of performing WR'^L with 
5 iterations on 256 x 256 pixels within 50 ms. 

GPU computation. We tested our GPU implementa- 
tion of WR'^L for general 2D point-spread functions on 
an nVidia GT-440 graphics card featuring 96 cores at a 
clock rate of 1620 MHz. The net computation times of 
the Wiener filter and RRRL iterations, and the entire 
WR'^L are shown in the second row of Table [SJ 

Firstly, standard deviations of these figures are much 
lower than for the CPU computations. This can be at- 
tributed to the fact that there are almost no other pro- 
cesses in the system that exploit the computing capa- 
bilities of the graphics card, and could thus interfere 
with the computation. 

Secondly, the time measurements show that GPU 
computation enables the general 2D deconvolution of 
256 X 256-pixel images to be performed reliably under 
our real-time constraint (50 ms); even the computation 
for 512 X 512-pixel images appears feasible. 
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Real-time-capable GPU deconvolution for general 
2D blurs has already been devised in [TU] , compare also 
the non-blind deconvolution part in the framework of 
[8]. Both papers use the KF algorithm [11] with the 
same parameter regime that we have used in Section |3l 
We juxtapose therefore our speed measurements with 
those reported in [TU]. 

The average total run-time from Table [5] amounts 
to about 9.16 megapixels per second of single-channel 
computation, i.e. about 58.9 pixels per core per 10^ 
clock cycles. Since our implementation is not applica- 
ble to arbitrary image sizes, the test case from [T^ is 
the one where the image size optimally fits their algo- 
rithm. For this scenario, |10j reports about 13 megapix- 
els per second (single-channel). The value refers to an 
nVidia GTX 260 graphics card with 192 cores (accord- 
ing to nVidia's official specification; [TU] states 216) and 
a 1242 MHz clock, which means that about 54.5 pixels 
are processed per core per 10^ clock cycles. Given the 
neglection of other influence factors, this can only serve 
as a rough comparison, but it demonstrates that the 
computational efficiency of both approaches is in the 
same range. 

With its slight advantage in restoration quality in a 
number of settings demonstrated in Section [31 WR^L 
lends itself therefore as an attractive candidate also for 
general 2D deconvolution under real-time conditions on 
the GPU. 

Nevertheless, some words of care must be said. In 
tune with our decision to exclude loading and storing 
images from the time measurements, the transfer be- 
tween main memory and graphics card memory is not 
contained in the mentioned run-time figures. However, 
including these data transfers does not substantially 
change the picture. In our example, these transfers total 
to about 0.4 ms. 

Let us also revisit our other exclusion, precompu- 
tation of auxiliary data for Fourier transforms. In our 
CPU implementation, inclusion of these computations 
would make the computations more expensive, but not 
dramatically so. In the GPU setting with CUDA's built- 
in Fourier transform, the precomputation of Fourier 
transform plans is fairly expensive. We measured run- 
times of about 120 ms for this step; however, it is open 
to some question whether this time is exaggerated by in- 
cluding some initialisation overhead. Of course, for the 
price of the expensive Fourier plan construction we get 
a degree of generality that is not with our specialised 
CPU implementation. Very likely, using a more flexi- 
ble ofl[-the-shelf Fourier transform package on the CPU 
would lead to a similar shift in balance between pre- 
computation and actual image processing. At any rate, 
for the practical efficiency of the proposed GPU-based 



deconvolution it is crucial to ensure that precomputed 
data arc retained and used for multiple images. 

6 Summary and Outlook 

We have demonstrated the design of an efficient and ro- 
bust deconvolution algorithm for known space- invariant 
blur. By combining Wiener filtering as a first step with 
a small number of iterations of robust and regularised 
Richardson-Lucy deconvolution [TU] a reasonable de- 
convolution quality is achieved at a fairly low computa- 
tional expense. We improved this basic method by algo- 
rithmic optimisations for specific blur scenarios, in par- 
ticular fast box filtering [13j for uniform linear motion 
blur. In this case, real-time performance was reached 
for moderate image sizes in single-threaded CPU com- 
putation. To our knowledge, there has been no compa- 
rable framework so far for CPU-based real-time image 
deconvolution, even if it is only in a specific setting. 

Exemplary implementations demonstrated also that 
comparable real-time performance can also be achieved 
for general ID blur by multi-threaded computation on 
a contemporary multi-core CPU, and for 2D blur using 
GPU computation. 

Ongoing work is directed at further specific blur set- 
tings as well as a more systematic investigation of ef- 
ficient parallel implementations. Also, the comparison 
with existing real-time-capable GPU-based deconvolu- 
tion in terms of restoration quality and robustness de- 
serves further consideration. We expect that by these 
efforts the applicability of deconvolution in automati- 
sation, quality inspection and further application fields 
will be significantly improved. 
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Table 1 Signal-to-noise ratios (dB) for the blurred from Figs. ^ and [2] and their deblurred versions, including some methods 
and parameter settings not shown in the Figures. Parameters K for Wiener filter given in brackets apply also to WR^L. 
Parameters A for WYYZ given in brackets apply also to KF. All remaining parameters were fixed to the values mentioned in 
the captions of Figs. ^ and [J] Further details are described in the text. 



Test 


Blur- 


Wiener 


RL 


WYYZ 


KF 








WR^L 


image 


red 


filter (K) 


30 it. 


6 it. (A) 


6 it. 


5 it. 


30 it. 


100 it. 


5 it. 


Fig. Ha) 


9.87 


13.61 (0.006) 


15.37 


14.48 (50) 


14.48 


11.76 


14.93 


16.86 


14.14 


Fig. Ha) 


4.67 


7.05 (0.006) 


7.44 


6.60 (3) 


6.59 


5.72 


7.49 


8.17 


7.27 


Fig.EJe) 


9.58 


11.50 (0.06) 


11.73 


12.59 (25) 


12.54 


11.42 


13.63 


13.96 


12.92 


Fig. Eli) 


3.19 


3.09 (0.16) 


1.84 


3.49 (1) 


3.47 


6.46 


10.07 


13.19 


7.60 



Table 2 Run times for sharpening the image from Fig. 2] (256 x 256) with Wiener filter followed by 5 iterations of RRRL. 
Details see text. Boldface figures: average times ± estimated standard deviation, figures in brackets: minimum/maximum. 
Statistics from 100 program runs for each method. 



Algorithm 


Wiener 


Run t 
RRRL iteration 


me (ms) 

RRRL total 


Total 


ID with box filter 


3.8 ±2.3 
(2.8. ..12.2) 


4.8 ±0.3 
(4.3. . .6.2) 


24.3 ±0.5 
(22.8. . .26.8) 


28.1 ± 2.4 

(25.8. . .37.1) 


ID with Fourier 


3.1 ±0.7 
(2.8. . .8.3) 


10.9 ±0.3 
(10.2. ..11.9) 


54.6 ± 1.1 
(52.6. . .57.0) 


57.7± 1.5 
(55.6. . .64.9) 


2D with Fourier 


14.1 ± 4.1 
(11.7. . .27.3) 


20.5 ±0.3 
(20.0. . .21.4) 


102. 7± 0.5 

(101.5. ..103.8) 


116. 7± 4.1 

(113.2. ..130.0) 



Table 3 Run times for sharpening images of different sizes with ID Wiener filter followed by 5 iterations of RRRL with 
box filter algorithm. Details see text. Boldface figures: average times ± estimated standard deviation, figures in brackets: 
minimum/maximum. Statistics from 100 program runs for each image size. 



Image size 




Run time (ms) 






Wiener 


RRRL iteration 


RRRL total 


Total 


128 X 128 


1.5± 1.1 


1.4± 1.1 


7.3 ±3.4 


8.7 ± 4.4 




(0.6. . .3.4) 


(0.9. . .4.7) 


(4.6. ..15.8) 


(5.2. ..18.7) 


256 X 256 


3.8 ±2.3 


4.8 ±0.3 


24.3 ±0.5 


28.1 ± 2.4 




(2.8. . . 12.2) 


(4.3. . .6.2) 


(22.8. . .26.8) 


(25.8. . .37.1) 


512 X 512 


14.2 ±0.2 


24.5 ±0.9 


123.6 ±0.8 


137.8 ± 0.8 




(14.0. . .15.8) 


(23.6. . .27.1) 


(121.6. . . 126.4) 


(135.7. ..140.4) 



Table 4 Run times for sharpening the image from Fig. 3] (256 x 256) with different methods suitable for general 2D blurs. 
Details see text. Boldface figures: average times ± estimated standard deviation, figures in brackets: minimum/maximum. 
Statistics from 100 program runs for each method. 



Algorithm 


Run time (ms) 


KF with analytic solver, 6 iterations 
(parameter regime from 1111) 


372.2 ± 3.2 

(366.1. . .384.4) 


WYYZ, 6 iterations 

(parameter regime adapted to 1111) 


82.0 ± 2.1 

(73.2. . .88.8) 


WR'^L, 5 iterations 
(2D Fourier) 


116. 7± 4.1 

(113.2. ..130.0) 



Table 5 Run times for sharpening the image from Fig. 3] (256 x 256) with Wiener filter followed by 5 iterations of RRRL in 
exemplary parallel implementations. Details see text. Boldface figures: average times ± estimated standard deviation, figures 
in brackets: minimum/maximum. Statistics from 100 program runs for each method. 



Implementation 


Wiener 


Run tim 
RRRL iteration 


e (ms) 

RRRL total 


Total 


Multi-threaded CPU 
(ID Fourier) 


4.0 ± 1.7 
(2.9. ..13.4) 


5.4 ±2.4 
(2.3. ..17.3) 


27.8 ±5.6 
(17.9. ..42.7) 


31.8± 6.4 

(20.9. . .48.2) 


GPU (CUDA) 
(2D Fourier) 


0.35 ± <0.01 
(0.34. . .0.36) 


1.36 ±0.01 
(1.34. ..1.39) 


6.80 ±0.01 
(6.78. ..6.84) 


7.15 ±0.01 
(7.13. ..7.19) 



